- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Linux] Prevent GC from running during process teardown #57832
Conversation
6be4221
to
71ba5fa
Compare
e1e3d4f
to
4d64c18
Compare
We should do this just before sending the signal to stop the thread though, otherwise we start running code concurrently with the GC. Can resume the GC immediately after though |
I'm confused, the signal handling thread doesn't touch the GC at all (I don't think it even gets adopted), and the exiting thread will run finalizers and such so its not gc safe |
4d64c18
to
2e2d508
Compare
Tested this, and the suggestion from here also fixes the false alarms' issue. |
2a2c495
to
5c9a592
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you make the same change on Windows so we keep those platforms consistent?
Seems like we don't call |
oh, good point, Windows' doesn't really have signals anyways so it is not relevant there |
## Context We send a signal 15 to shutdown our servers. We noticed that some of our servers that receive the termination signal are segfaulting in GC, which leads to false alarms in our internal monitors that track GC-related crashes. ## Hypothesis We suspect this pathological case may be happening: - Process receives signal 15, which is captured by the signal listener thread. - Signal listener initiates process' teardown (e.g. through `raise`). - IIRC such operation is not atomic in Linux, i.e. the kernel will gradually kill the threads, but it's possible for us to spent a few ms in a state where part of the threads in the system are alive, and part have already been killed (this point needs some confirmation). - With part of the process alive, and part of the process dead, we try to enter a GC, see a bunch of Julia data structures in an intermediate/corrupted state, which leads us to crash when running the GC. ## Mitigation Since our main goal is to get rid of the GC crashes that happen around server shutdown, we believe that it would be sufficient to just prevent the last bullet point. I.e. we prevent the system from even running a GC when we're about to kill the process, and we wait for any ongoing GC to finish. Co-debugged with @kpamnany. (cherry picked from commit e1e3a46)
Context
We send a signal 15 to shutdown our servers.
We noticed that some of our servers that receive the termination signal are segfaulting in GC, which leads to false alarms in our internal monitors that track GC-related crashes.
Hypothesis
We suspect this pathological case may be happening:
raise
).Mitigation
Since our main goal is to get rid of the GC crashes that happen around server shutdown, we believe that it would be sufficient to just prevent the last bullet point. I.e. we prevent the system from even running a GC when we're about to kill the process, and we wait for any ongoing GC to finish.
Co-debugged with @kpamnany.